-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add remove sparse features preprocessor #4093
Conversation
I like the preprocessor, but do we have a better name for it? :) I didn't realize what it does before seeing the code. |
Codecov Report
@@ Coverage Diff @@
## master #4093 +/- ##
==========================================
+ Coverage 85.75% 85.76% +0.01%
==========================================
Files 389 389
Lines 69839 69899 +60
==========================================
+ Hits 59892 59952 +60
Misses 9947 9947 |
Orange/preprocess/preprocess.py
Outdated
""" | ||
|
||
def __init__(self, treshold=0.05): | ||
self.treshold = treshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to be this person, but perhaps for clarity and for maintaining a good API, I would correct the name to threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This, along with length is in my top misspelled words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always write 'widget' when I want to write 'width'. 🤷♀
I would support "sparse feature" in the name of the widget. One reason is that I cannot come up with a better one, and second because this term has been used in this context. For instance, googling '"sparse features" machine learning' returns 83.000 pages. |
This works well also on bag of words sparse type. I was wondering would it make sense to have also an int option. As in Max instances with nonzero values... 🤔 In the case of text mining this would then be similar to document frequency filter. |
Sounds good, we can add that in the second iteration of this preprocessor. |
@AndrejaKovacic, I added an icon. I reused something we already had in "Data". What do you think? |
I think it's fitting, better than select column. |
@AndrejaKovacic, if you pull master and rebase, it should no longer crash on SOM. I apologize for inconvenience. :) This is assigned to @BlazZupan. If UX is confirmed, I can check the code and merge. |
Orange/preprocess/preprocess.py
Outdated
h, w = data_csc.shape | ||
sparsness = [data_csc[:, i].count_nonzero()/h for i in range(w)] | ||
else: | ||
sparsness = np.count_nonzero(data.X, axis=0)/ data.X.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you'll be rebasing and pushing anyway, you can also add a space before /
, so it's balanced. Bonus points for putting spaces before and after /
two lines above.
Orange/widgets/data/owpreprocess.py
Outdated
self.changed.emit() | ||
|
||
def parameters(self): | ||
return {'sparse_thresh':self.sparse_thresh} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A space after :
would be nice, too.
Orange/widgets/data/owpreprocess.py
Outdated
def createinstance(params): | ||
params = dict(params) | ||
threshold = params.pop('sparse_thresh', 5) | ||
return RemoveSparse(threshold=threshold/100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And just a pair of spaces here and I'm merging - when it's rebased so it passes SOM.
5bebeb5
to
eb9a6cd
Compare
Issue
Description of changes
Preprocessor widget now enables removing sparse features. Threshold is up to the user. I thought we could use the same icon as feature selection and cur decomposition.
Includes